Businesses like banks which provide service have to worry about problem of 'Customer Churn' i.e. customers leaving and joining another service provider. It is important to understand which aspects of the service influence a customer's decision in this regard. Management can concentrate efforts on improvement of service, keeping in mind these priorities.
You as a Data scientist with the bank need to build a neural network based classifier that can determine whether a customer will leave the bank or not in the next 6 months.
CustomerId: Unique ID which is assigned to each customer
Surname: Last name of the customer
CreditScore: It defines the credit history of the customer.
Geography: A customer’s location
Gender: It defines the Gender of the customer
Age: Age of the customer
Tenure: Number of years for which the customer has been with the bank
NumOfProducts: refers to the number of products that a customer has purchased through the bank.
Balance: Account balance
HasCrCard: It is a categorical variable which decides whether the customer has credit card or not.
EstimatedSalary: Estimated salary
isActiveMember: Is is a categorical variable which decides whether the customer is active member of the bank or not ( Active member in the sense, using bank products regularly, making transactions etc )
Exited : whether or not the customer left the bank within six month. It can take two values 0=No ( Customer did not leave the bank ) 1=Yes ( Customer left the bank )
!pip install tensorflow==2.15.0 scikit-learn==1.2.2 matplotlib===3.7.1 seaborn==0.13.1 numpy==1.25.2 pandas==1.5.3 -q --user
# Library for data manipulation and analysis.
import pandas as pd
# Fundamental package for scientific computing.
import numpy as np
#splitting datasets into training and testing sets.
from sklearn.model_selection import train_test_split
#Imports tools for data preprocessing including label encoding, one-hot encoding, and standard scaling
from sklearn.preprocessing import LabelEncoder, OneHotEncoder,StandardScaler
#Imports a class for imputing missing values in datasets.
from sklearn.impute import SimpleImputer
#Imports the Matplotlib library for creating visualizations.
import matplotlib.pyplot as plt
# Imports the Seaborn library for statistical data visualization.
import seaborn as sns
# Time related functions.
import time
#Imports functions for evaluating the performance of machine learning models
from sklearn.metrics import confusion_matrix, f1_score,accuracy_score, recall_score, precision_score, classification_report
#importing SMOTE
from imblearn.over_sampling import SMOTE
import random
#Imports the tensorflow,keras and layers.
import tensorflow as tf
from tensorflow import keras
from keras import backend
from keras.models import Sequential
from keras.layers import Dense, Dropout
# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
#Read dataset
data = pd.read_csv('Churn.csv')
# View the first 5 rows of the data
data.head()
| RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
# View the last 5 rows of the data
data.tail()
| RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9995 | 9996 | 15606229 | Obijiaku | 771 | France | Male | 39 | 5 | 0.00 | 2 | 1 | 0 | 96270.64 | 0 |
| 9996 | 9997 | 15569892 | Johnstone | 516 | France | Male | 35 | 10 | 57369.61 | 1 | 1 | 1 | 101699.77 | 0 |
| 9997 | 9998 | 15584532 | Liu | 709 | France | Female | 36 | 7 | 0.00 | 1 | 0 | 1 | 42085.58 | 1 |
| 9998 | 9999 | 15682355 | Sabbatini | 772 | Germany | Male | 42 | 3 | 75075.31 | 2 | 1 | 0 | 92888.52 | 1 |
| 9999 | 10000 | 15628319 | Walker | 792 | France | Female | 28 | 4 | 130142.79 | 1 | 1 | 0 | 38190.78 | 0 |
# Check number of rows and columns
data.shape
(10000, 14)
# check the datatypes of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RowNumber 10000 non-null int64 1 CustomerId 10000 non-null int64 2 Surname 10000 non-null object 3 CreditScore 10000 non-null int64 4 Geography 10000 non-null object 5 Gender 10000 non-null object 6 Age 10000 non-null int64 7 Tenure 10000 non-null int64 8 Balance 10000 non-null float64 9 NumOfProducts 10000 non-null int64 10 HasCrCard 10000 non-null int64 11 IsActiveMember 10000 non-null int64 12 EstimatedSalary 10000 non-null float64 13 Exited 10000 non-null int64 dtypes: float64(2), int64(9), object(3) memory usage: 1.1+ MB
# check for duplicate values in the data
data.duplicated().sum()
0
# check for missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)
RowNumber 0.0 CustomerId 0.0 Surname 0.0 CreditScore 0.0 Geography 0.0 Gender 0.0 Age 0.0 Tenure 0.0 Balance 0.0 NumOfProducts 0.0 HasCrCard 0.0 IsActiveMember 0.0 EstimatedSalary 0.0 Exited 0.0 dtype: float64
data["Exited"].value_counts(1)
0 0.7963 1 0.2037 Name: Exited, dtype: float64
# statistical summary of the numerical columns in the data
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| RowNumber | 10000.0 | 5.000500e+03 | 2886.895680 | 1.00 | 2500.75 | 5.000500e+03 | 7.500250e+03 | 10000.00 |
| CustomerId | 10000.0 | 1.569094e+07 | 71936.186123 | 15565701.00 | 15628528.25 | 1.569074e+07 | 1.575323e+07 | 15815690.00 |
| CreditScore | 10000.0 | 6.505288e+02 | 96.653299 | 350.00 | 584.00 | 6.520000e+02 | 7.180000e+02 | 850.00 |
| Age | 10000.0 | 3.892180e+01 | 10.487806 | 18.00 | 32.00 | 3.700000e+01 | 4.400000e+01 | 92.00 |
| Tenure | 10000.0 | 5.012800e+00 | 2.892174 | 0.00 | 3.00 | 5.000000e+00 | 7.000000e+00 | 10.00 |
| Balance | 10000.0 | 7.648589e+04 | 62397.405202 | 0.00 | 0.00 | 9.719854e+04 | 1.276442e+05 | 250898.09 |
| NumOfProducts | 10000.0 | 1.530200e+00 | 0.581654 | 1.00 | 1.00 | 1.000000e+00 | 2.000000e+00 | 4.00 |
| HasCrCard | 10000.0 | 7.055000e-01 | 0.455840 | 0.00 | 0.00 | 1.000000e+00 | 1.000000e+00 | 1.00 |
| IsActiveMember | 10000.0 | 5.151000e-01 | 0.499797 | 0.00 | 0.00 | 1.000000e+00 | 1.000000e+00 | 1.00 |
| EstimatedSalary | 10000.0 | 1.000902e+05 | 57510.492818 | 11.58 | 51002.11 | 1.001939e+05 | 1.493882e+05 | 199992.48 |
| Exited | 10000.0 | 2.037000e-01 | 0.402769 | 0.00 | 0.00 | 0.000000e+00 | 0.000000e+00 | 1.00 |
# Check the number of unique values in each column
data.nunique()
RowNumber 10000 CustomerId 10000 Surname 2932 CreditScore 460 Geography 3 Gender 2 Age 70 Tenure 11 Balance 6382 NumOfProducts 4 HasCrCard 2 IsActiveMember 2 EstimatedSalary 9999 Exited 2 dtype: int64
for i in data.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)
Unique values in Surname are :
Smith 32
Martin 29
Scott 29
Walker 28
Brown 26
..
Wells 1
Calzada 1
Gresswell 1
Aguirre 1
Morales 1
Name: Surname, Length: 2932, dtype: int64
**************************************************
Unique values in Geography are :
France 5014
Germany 2509
Spain 2477
Name: Geography, dtype: int64
**************************************************
Unique values in Gender are :
Male 5457
Female 4543
Name: Gender, dtype: int64
**************************************************
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
num_col_sel = data.select_dtypes(include=np.number).columns.tolist()
for item in num_col_sel:
histogram_boxplot(data, item)
Median estimated salary is 100000.
data['CustomerId'].nunique()
10000
CustomerId is unique for each customer.
data['Surname'].nunique()
2932
data['Surname'].value_counts()
Smith 32
Martin 29
Scott 29
Walker 28
Brown 26
..
Wells 1
Calzada 1
Gresswell 1
Aguirre 1
Morales 1
Name: Surname, Length: 2932, dtype: int64
32 customers had surname "Smith".
data['CreditScore'].nunique()
460
data['CreditScore'].value_counts()
850 233
678 63
655 54
705 53
667 53
...
351 1
365 1
382 1
373 1
419 1
Name: CreditScore, Length: 460, dtype: int64
sns.boxplot(data=data,x='CreditScore')
#Boxplot to show the distribution of Customer_Age
<Axes: xlabel='CreditScore'>
Credit score has many outliers on the lower end but the mean falls around 650.
data['Geography'].nunique()
3
data['Geography'].value_counts()
France 5014 Germany 2509 Spain 2477 Name: Geography, dtype: int64
labeled_barplot(data,'Geography', perc=True)
50.1% customers are from France. The three countries that customers belong to are france, germany, and spain.
data['Gender'].nunique()
2
data['Gender'].value_counts()
Male 5457 Female 4543 Name: Gender, dtype: int64
labeled_barplot(data,'Gender', perc=True)
data['Age'].nunique()
70
data['Age'].value_counts()
37 478
38 477
35 474
36 456
34 447
...
84 2
88 1
82 1
85 1
83 1
Name: Age, Length: 70, dtype: int64
The mode for Customer Age is 44
sns.boxplot(data=data,x='Age')
#Boxplot to show the distribution of Customer_Age
<Axes: xlabel='Age'>
data['Tenure'].nunique()
11
data['Tenure'].value_counts()
2 1048 1 1035 7 1028 8 1025 5 1012 3 1009 4 989 9 984 6 967 10 490 0 413 Name: Tenure, dtype: int64
sns.boxplot(data=data,x='Tenure')
#Boxplot to show the distribution of Customer_Age
<Axes: xlabel='Tenure'>
50 % of the customers have stayed for a tenure of 3 to 7 years.
data['NumOfProducts'].nunique()
4
data['NumOfProducts'].value_counts()
1 5084 2 4590 3 266 4 60 Name: NumOfProducts, dtype: int64
sns.boxplot(data=data,x='NumOfProducts')
#Boxplot to show the distribution of Total_Trans_Ct
<Axes: xlabel='NumOfProducts'>
75% have 2 products or less.
data['Balance'].nunique()
6382
data['Balance'].value_counts()
0.00 3617
130170.82 2
105473.74 2
159397.75 1
144238.70 1
...
108698.96 1
238387.56 1
111833.47 1
126619.27 1
138734.94 1
Name: Balance, Length: 6382, dtype: int64
sns.boxplot(data=data,x='Balance')
#Boxplot to show the distribution of Total_Trans_Ct
<Axes: xlabel='Balance'>
data['HasCrCard'].nunique()
2
data['HasCrCard'].value_counts()
1 7055 0 2945 Name: HasCrCard, dtype: int64
labeled_barplot(data,'HasCrCard', perc=True)
Over 70 % have a credit card.
data['EstimatedSalary'].nunique()
9999
data['EstimatedSalary'].value_counts()
24924.92 2
121505.61 1
89874.82 1
72500.68 1
182692.80 1
..
188377.21 1
55902.93 1
4523.74 1
102195.16 1
2465.80 1
Name: EstimatedSalary, Length: 9999, dtype: int64
sns.boxplot(data=data,x='EstimatedSalary')
#Boxplot to show the distribution of Total_Trans_Ct
<Axes: xlabel='EstimatedSalary'>
50% of the customers have a salary of a 100k or more.
data['IsActiveMember'].nunique()
2
data['IsActiveMember'].value_counts()
1 5151 0 4849 Name: IsActiveMember, dtype: int64
labeled_barplot(data,'IsActiveMember', perc=True)
48.5% of the members are not active.
data['Exited'].nunique()
2
data['Exited'].value_counts()
0 7963 1 2037 Name: Exited, dtype: int64
labeled_barplot(data,'Exited', perc=True)
20.4% of the customers have exited.
num_col = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 7))
sns.heatmap(data[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Age shows a relatively large correlance with Exited. IsActiveMember showed a negative correlance with Exited. Balance has a posible correlance to exited.
sns.pairplot(data=data[num_col], diag_kind="kde")
plt.show()
data['Surname'].value_counts(5)
Smith 0.0032
Martin 0.0029
Scott 0.0029
Walker 0.0028
Brown 0.0026
...
Wells 0.0001
Calzada 0.0001
Gresswell 0.0001
Aguirre 0.0001
Morales 0.0001
Name: Surname, Length: 2932, dtype: float64
distribution_plot_wrt_target(data, "CreditScore", "Exited")
There are more outliers with credit score below 400 for customers that left the bank.
stacked_barplot(data, "Geography", "Exited")
Exited 0 1 All Geography All 7963 2037 10000 Germany 1695 814 2509 France 4204 810 5014 Spain 2064 413 2477 ------------------------------------------------------------------------------------------------------------------------
Although the largest customer base is in France, the highest number of customers who left were based in Germany.
stacked_barplot(data, "Gender", "Exited")
Exited 0 1 All Gender All 7963 2037 10000 Female 3404 1139 4543 Male 4559 898 5457 ------------------------------------------------------------------------------------------------------------------------
Higher percentage of females exited the bank.
distribution_plot_wrt_target(data, "Age", "Exited")
50% of the customers who exited are of age 45 or older, while more than 75% of the customers who did not exit are younger than 45.
distribution_plot_wrt_target(data, "Tenure", "Exited")
stacked_barplot(data, "NumOfProducts", "Exited")
Exited 0 1 All NumOfProducts All 7963 2037 10000 1 3675 1409 5084 2 4242 348 4590 3 46 220 266 4 0 60 60 ------------------------------------------------------------------------------------------------------------------------
Customers with
distribution_plot_wrt_target(data, "Balance", "Exited")
stacked_barplot(data, "HasCrCard", "Exited")
Exited 0 1 All HasCrCard All 7963 2037 10000 1 5631 1424 7055 0 2332 613 2945 ------------------------------------------------------------------------------------------------------------------------
distribution_plot_wrt_target(data, "EstimatedSalary", "Exited")
stacked_barplot(data, "IsActiveMember", "Exited")
Exited 0 1 All IsActiveMember All 7963 2037 10000 0 3547 1302 4849 1 4416 735 5151 ------------------------------------------------------------------------------------------------------------------------
#Drop column "CustomerId as it is unique and will not add value to the modeling
data.drop(['CustomerId'], axis=1, inplace=True)
#Separate target and dependent Columns
X = data.drop(['Exited'],axis=1)
Y = data['Exited']
X.columns
Index(['RowNumber', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age',
'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
'EstimatedSalary'],
dtype='object')
#Calculate the total number of null values for each columns.
X.isnull().sum()
RowNumber 0 Surname 0 CreditScore 0 Geography 0 Gender 0 Age 0 Tenure 0 Balance 0 NumOfProducts 0 HasCrCard 0 IsActiveMember 0 EstimatedSalary 0 dtype: int64
There are no missing values.
#Encode categorical variables using one-hot encoding
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object"]).columns.tolist(),
drop_first=True, dtype= float
)
#Split dataset into the Training set and Test set.
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42,stratify = Y)
# Split Train dataset into the Training and Validation sets.
X_train, X_valid, y_train, y_valid = train_test_split(X_train,y_train, test_size = 0.2, random_state = 42,stratify = y_train)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_valid.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6400 Number of rows in validation data = 1600 Number of rows in test data = 2000
print("Number of rows in train data =", y_train.shape[0])
print("Number of rows in validation data =", y_valid.shape[0])
print("Number of rows in test data =", y_test.shape[0])
Number of rows in train data = 6400 Number of rows in validation data = 1600 Number of rows in test data = 2000
#Standardize the numerical variables
num_col = X.select_dtypes(include=np.number).columns.tolist()
transformer = StandardScaler()
X_train[num_col] = transformer.fit_transform(X_train[num_col])
X_valid[num_col] = transformer.fit_transform(X_valid[num_col])
X_test[num_col] = transformer.fit_transform(X_test[num_col])
Write down the logic for choosing the metric that would be the best metric for this business scenario.
-
Recall would be the best metric for this situation because of it's ability to give adequate importance to minority class. Since the group of exited class is a minority, recall is a better option. Also high recall makes sure that positive cases are not ignored. Therefore recall is the best metric for this business scenario.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred = model.predict(predictors) > threshold
# pred_temp = model.predict(predictors) > threshold
# # rounding off the above values to get classes
# pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred, average='weighted') # to compute Recall
precision = precision_score(target, pred, average='weighted') # to compute Precision
f1 = f1_score(target, pred, average='weighted') # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1 Score": f1,},
index=[0],
)
return df_perf
def plot(history, name):
"""
Function to plot loss/accuracy
history: an object which stores the metrics and losses.
name: can be one of Loss or Accuracy
"""
fig, ax = plt.subplots() #Creating a subplot with figure and axes.
plt.plot(history.history[name]) #Plotting the train accuracy or train loss
plt.plot(history.history['val_'+name]) #Plotting the validation accuracy or validation loss
plt.title('Model ' + name.capitalize()) #Defining the title of the plot.
plt.ylabel(name.capitalize()) #Capitalizing the first letter.
plt.xlabel('Epoch') #Defining the label for the x-axis.
fig.legend(['Train', 'Validation'], loc="outside right upper") #Defining the legend, loc controls the position of the legend.
def make_confusion_matrix(actual_targets, predicted_targets):
"""
To plot the confusion_matrix with percentages
actual_targets: actual target (dependent) variable values
predicted_targets: predicted target (dependent) variable values
"""
cm = confusion_matrix(actual_targets, predicted_targets)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(cm.shape[0], cm.shape[1])
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
train_data=pd.DataFrame(columns=["recall"])
valid_data=pd.DataFrame(columns=["recall"])
#*** Calculate class weights for imbalanced dataset
cw = (y_train.shape[0]) / np.bincount(y_train)
# Create a dictionary mapping class indices to their respective class weights
cw_dict = {}
for i in range(cw.shape[0]):
cw_dict[i] = cw[i]
cw_dict
{0: 1.2558869701726845, 1: 4.9079754601226995}
# defining the batch size and # epochs upfront as we'll be using the same values for all models
epochs = 25
batch_size = 65
backend.clear_session()
np.random.seed(2)
random.seed(2)
tf.random.set_seed(2)
#Initializing the neural network
model0 = Sequential()
model0.add(Dense(70,activation="relu",input_dim=X_train.shape[1]))
model0.add(Dense(17,activation="relu"))
model0.add(Dense(1,activation="sigmoid"))
model0.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 70) 206080
dense_1 (Dense) (None, 17) 1207
dense_2 (Dense) (None, 1) 18
=================================================================
Total params: 207305 (809.79 KB)
Trainable params: 207305 (809.79 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.SGD(0.001) # defining SGD as the optimizer to be used
metric = keras.metrics.Recall()
model0.compile(loss='binary_crossentropy', optimizer=optimizer,metrics=['Recall'])
start = time.time()
history = model0.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/25 99/99 [==============================] - 6s 14ms/step - loss: 1.5906 - recall: 0.7400 - val_loss: 0.7870 - val_recall: 0.6779 Epoch 2/25 99/99 [==============================] - 0s 5ms/step - loss: 1.4873 - recall: 0.6426 - val_loss: 0.7406 - val_recall: 0.5613 Epoch 3/25 99/99 [==============================] - 1s 5ms/step - loss: 1.4291 - recall: 0.5890 - val_loss: 0.7214 - val_recall: 0.4969 Epoch 4/25 99/99 [==============================] - 0s 5ms/step - loss: 1.3840 - recall: 0.6043 - val_loss: 0.7078 - val_recall: 0.4877 Epoch 5/25 99/99 [==============================] - 1s 5ms/step - loss: 1.3461 - recall: 0.6288 - val_loss: 0.6961 - val_recall: 0.4663 Epoch 6/25 99/99 [==============================] - 0s 5ms/step - loss: 1.3133 - recall: 0.6327 - val_loss: 0.6903 - val_recall: 0.4509 Epoch 7/25 99/99 [==============================] - 0s 5ms/step - loss: 1.2837 - recall: 0.6511 - val_loss: 0.6855 - val_recall: 0.4356 Epoch 8/25 99/99 [==============================] - 0s 5ms/step - loss: 1.2565 - recall: 0.6741 - val_loss: 0.6790 - val_recall: 0.4294 Epoch 9/25 99/99 [==============================] - 0s 5ms/step - loss: 1.2313 - recall: 0.6848 - val_loss: 0.6737 - val_recall: 0.4110 Epoch 10/25 99/99 [==============================] - 0s 5ms/step - loss: 1.2074 - recall: 0.6856 - val_loss: 0.6724 - val_recall: 0.4294 Epoch 11/25 99/99 [==============================] - 0s 5ms/step - loss: 1.1848 - recall: 0.7040 - val_loss: 0.6680 - val_recall: 0.4294 Epoch 12/25 99/99 [==============================] - 0s 5ms/step - loss: 1.1630 - recall: 0.7178 - val_loss: 0.6637 - val_recall: 0.4141 Epoch 13/25 99/99 [==============================] - 0s 5ms/step - loss: 1.1417 - recall: 0.7301 - val_loss: 0.6585 - val_recall: 0.3988 Epoch 14/25 99/99 [==============================] - 1s 7ms/step - loss: 1.1214 - recall: 0.7377 - val_loss: 0.6538 - val_recall: 0.3834 Epoch 15/25 99/99 [==============================] - 1s 6ms/step - loss: 1.1015 - recall: 0.7393 - val_loss: 0.6506 - val_recall: 0.3773 Epoch 16/25 99/99 [==============================] - 1s 7ms/step - loss: 1.0821 - recall: 0.7508 - val_loss: 0.6476 - val_recall: 0.3773 Epoch 17/25 99/99 [==============================] - 1s 7ms/step - loss: 1.0631 - recall: 0.7577 - val_loss: 0.6459 - val_recall: 0.3834 Epoch 18/25 99/99 [==============================] - 1s 7ms/step - loss: 1.0447 - recall: 0.7661 - val_loss: 0.6430 - val_recall: 0.3865 Epoch 19/25 99/99 [==============================] - 0s 5ms/step - loss: 1.0266 - recall: 0.7730 - val_loss: 0.6398 - val_recall: 0.3896 Epoch 20/25 99/99 [==============================] - 0s 5ms/step - loss: 1.0091 - recall: 0.7738 - val_loss: 0.6383 - val_recall: 0.3957 Epoch 21/25 99/99 [==============================] - 0s 5ms/step - loss: 0.9921 - recall: 0.7807 - val_loss: 0.6370 - val_recall: 0.3926 Epoch 22/25 99/99 [==============================] - 0s 4ms/step - loss: 0.9754 - recall: 0.7899 - val_loss: 0.6317 - val_recall: 0.3896 Epoch 23/25 99/99 [==============================] - 0s 5ms/step - loss: 0.9594 - recall: 0.7899 - val_loss: 0.6309 - val_recall: 0.3896 Epoch 24/25 99/99 [==============================] - 0s 5ms/step - loss: 0.9435 - recall: 0.7991 - val_loss: 0.6284 - val_recall: 0.3865 Epoch 25/25 99/99 [==============================] - 0s 5ms/step - loss: 0.9284 - recall: 0.8029 - val_loss: 0.6257 - val_recall: 0.3865
print("Time taken in seconds ",end-start)
Time taken in seconds 19.47603940963745
plot(history,'loss')
y_train_pred = model0.predict(X_train)
y_train_pred = (y_train_pred > 0.5)
y_train_pred
200/200 [==============================] - 1s 3ms/step
array([[ True],
[False],
[False],
...,
[False],
[ True],
[False]])
y_valid_pred = model0.predict(X_valid)
y_valid_pred = (y_valid_pred > 0.5)
y_valid_pred
50/50 [==============================] - 0s 5ms/step
array([[False],
[False],
[False],
...,
[ True],
[ True],
[ True]])
cl_rp = classification_report(y_train, y_train_pred)
print(cl_rp)
precision recall f1-score support
0 0.94 0.79 0.86 5096
1 0.50 0.82 0.62 1304
accuracy 0.79 6400
macro avg 0.72 0.80 0.74 6400
weighted avg 0.85 0.79 0.81 6400
cl_rp = classification_report(y_valid, y_valid_pred)
print(cl_rp)
precision recall f1-score support
0 0.81 0.69 0.75 1274
1 0.24 0.39 0.30 326
accuracy 0.63 1600
macro avg 0.53 0.54 0.52 1600
weighted avg 0.70 0.63 0.66 1600
modelName="NN SGD"
train_data.loc[modelName] = recall_score(y_train, y_train_pred)
valid_data.loc[modelName] = recall_score(y_valid, y_valid_pred)
make_confusion_matrix(y_train, y_train_pred)
make_confusion_matrix(y_valid, y_valid_pred)
Recall score for both training set and validation set can be improved. Possibly an underfitted model.
backend.clear_session()
np.random.seed(2)
random.seed(2)
tf.random.set_seed(2)
#Initializing the neural network
model = Sequential()
model.add(Dense(70,activation="relu",input_dim=X_train.shape[1]))
model.add(Dense(17,activation="relu"))
model.add(Dense(1,activation="sigmoid"))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 70) 206080
dense_1 (Dense) (None, 17) 1207
dense_2 (Dense) (None, 1) 18
=================================================================
Total params: 207305 (809.79 KB)
Trainable params: 207305 (809.79 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining SGD as the optimizer to be used
metric = keras.metrics.Recall()
model.compile(loss='binary_crossentropy', optimizer=optimizer,metrics=['Recall'])
start = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/25 99/99 [==============================] - 2s 8ms/step - loss: 1.5271 - recall: 0.5023 - val_loss: 0.7103 - val_recall: 0.6350 Epoch 2/25 99/99 [==============================] - 1s 7ms/step - loss: 1.0437 - recall: 0.7945 - val_loss: 0.6428 - val_recall: 0.4785 Epoch 3/25 99/99 [==============================] - 1s 7ms/step - loss: 0.8643 - recall: 0.8090 - val_loss: 0.6878 - val_recall: 0.5368 Epoch 4/25 99/99 [==============================] - 1s 8ms/step - loss: 0.7701 - recall: 0.8459 - val_loss: 0.6950 - val_recall: 0.5613 Epoch 5/25 99/99 [==============================] - 1s 8ms/step - loss: 0.7176 - recall: 0.8597 - val_loss: 0.6614 - val_recall: 0.5368 Epoch 6/25 99/99 [==============================] - 1s 9ms/step - loss: 0.6704 - recall: 0.8612 - val_loss: 0.6736 - val_recall: 0.5460 Epoch 7/25 99/99 [==============================] - 1s 8ms/step - loss: 0.6408 - recall: 0.8873 - val_loss: 0.6704 - val_recall: 0.5368 Epoch 8/25 99/99 [==============================] - 1s 7ms/step - loss: 0.6012 - recall: 0.8934 - val_loss: 0.7028 - val_recall: 0.5798 Epoch 9/25 99/99 [==============================] - 1s 7ms/step - loss: 0.5700 - recall: 0.9087 - val_loss: 0.7018 - val_recall: 0.5644 Epoch 10/25 99/99 [==============================] - 0s 5ms/step - loss: 0.5349 - recall: 0.9149 - val_loss: 0.7486 - val_recall: 0.6012 Epoch 11/25 99/99 [==============================] - 0s 5ms/step - loss: 0.5027 - recall: 0.9241 - val_loss: 0.7366 - val_recall: 0.5920 Epoch 12/25 99/99 [==============================] - 0s 5ms/step - loss: 0.4707 - recall: 0.9394 - val_loss: 0.7670 - val_recall: 0.5982 Epoch 13/25 99/99 [==============================] - 0s 5ms/step - loss: 0.4320 - recall: 0.9517 - val_loss: 0.7595 - val_recall: 0.5890 Epoch 14/25 99/99 [==============================] - 0s 5ms/step - loss: 0.3945 - recall: 0.9555 - val_loss: 0.7474 - val_recall: 0.5706 Epoch 15/25 99/99 [==============================] - 0s 5ms/step - loss: 0.3568 - recall: 0.9578 - val_loss: 0.7983 - val_recall: 0.5828 Epoch 16/25 99/99 [==============================] - 0s 5ms/step - loss: 0.3180 - recall: 0.9663 - val_loss: 0.8367 - val_recall: 0.5920 Epoch 17/25 99/99 [==============================] - 0s 5ms/step - loss: 0.2817 - recall: 0.9747 - val_loss: 0.8649 - val_recall: 0.5828 Epoch 18/25 99/99 [==============================] - 0s 5ms/step - loss: 0.2574 - recall: 0.9709 - val_loss: 0.8549 - val_recall: 0.5706 Epoch 19/25 99/99 [==============================] - 0s 5ms/step - loss: 0.2206 - recall: 0.9793 - val_loss: 0.9081 - val_recall: 0.5675 Epoch 20/25 99/99 [==============================] - 1s 9ms/step - loss: 0.1857 - recall: 0.9839 - val_loss: 0.9439 - val_recall: 0.5644 Epoch 21/25 99/99 [==============================] - 1s 10ms/step - loss: 0.1655 - recall: 0.9893 - val_loss: 1.0050 - val_recall: 0.5583 Epoch 22/25 99/99 [==============================] - 1s 9ms/step - loss: 0.1472 - recall: 0.9854 - val_loss: 0.9935 - val_recall: 0.5399 Epoch 23/25 99/99 [==============================] - 1s 12ms/step - loss: 0.1254 - recall: 0.9931 - val_loss: 1.0196 - val_recall: 0.5399 Epoch 24/25 99/99 [==============================] - 1s 14ms/step - loss: 0.1057 - recall: 0.9939 - val_loss: 1.0286 - val_recall: 0.5245 Epoch 25/25 99/99 [==============================] - 1s 11ms/step - loss: 0.0906 - recall: 0.9954 - val_loss: 1.0719 - val_recall: 0.5337
print("Time taken in seconds ",end-start)
Time taken in seconds 20.29223370552063
plot(history,'loss')
y_train_pred = model.predict(X_train)
y_train_pred = (y_train_pred > 0.5)
y_train_pred
200/200 [==============================] - 1s 3ms/step
array([[ True],
[False],
[False],
...,
[False],
[ True],
[False]])
y_valid_pred = model.predict(X_valid)
y_valid_pred = (y_valid_pred > 0.5)
y_valid_pred
50/50 [==============================] - 0s 2ms/step
array([[False],
[False],
[False],
...,
[False],
[ True],
[False]])
cl_rp = classification_report(y_train, y_train_pred)
print(cl_rp)
precision recall f1-score support
0 1.00 0.99 0.99 5096
1 0.95 1.00 0.97 1304
accuracy 0.99 6400
macro avg 0.97 0.99 0.98 6400
weighted avg 0.99 0.99 0.99 6400
cl_rp = classification_report(y_valid, y_valid_pred)
print(cl_rp)
precision recall f1-score support
0 0.87 0.77 0.82 1274
1 0.38 0.53 0.44 326
accuracy 0.72 1600
macro avg 0.62 0.65 0.63 1600
weighted avg 0.77 0.72 0.74 1600
modelName="NN Adam"
train_data.loc[modelName] = recall_score(y_train, y_train_pred)
valid_data.loc[modelName] = recall_score(y_valid, y_valid_pred)
make_confusion_matrix(y_train, y_train_pred)
make_confusion_matrix(y_valid, y_valid_pred)
NN Adam improved performance on both validation and training set compared to NN SGD.
backend.clear_session()
np.random.seed(2)
random.seed(2)
tf.random.set_seed(2)
#Initializing the neural network
model1 = Sequential()
model1.add(Dense(35,activation="relu",input_dim=X_train.shape[1]))
model1.add(Dropout(0.4))
model1.add(Dense(7,activation="relu"))
model1.add(Dropout(0.2))
model1.add(Dense(1,activation="sigmoid"))
model1.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 35) 103040
dropout (Dropout) (None, 35) 0
dense_1 (Dense) (None, 7) 252
dropout_1 (Dropout) (None, 7) 0
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 103300 (403.52 KB)
Trainable params: 103300 (403.52 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining SGD as the optimizer to be used
metric = keras.metrics.Recall()
model1.compile(loss='binary_crossentropy', optimizer=optimizer,metrics = ['Recall'])
start = time.time()
history = model1.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/25 99/99 [==============================] - 4s 7ms/step - loss: 1.8455 - recall: 0.5897 - val_loss: 0.6725 - val_recall: 0.2791 Epoch 2/25 99/99 [==============================] - 0s 5ms/step - loss: 1.4818 - recall: 0.4701 - val_loss: 0.6574 - val_recall: 0.2822 Epoch 3/25 99/99 [==============================] - 0s 5ms/step - loss: 1.3689 - recall: 0.5215 - val_loss: 0.6477 - val_recall: 0.3129 Epoch 4/25 99/99 [==============================] - 1s 6ms/step - loss: 1.2457 - recall: 0.6495 - val_loss: 0.6402 - val_recall: 0.4080 Epoch 5/25 99/99 [==============================] - 1s 5ms/step - loss: 1.1704 - recall: 0.6879 - val_loss: 0.6253 - val_recall: 0.4356 Epoch 6/25 99/99 [==============================] - 0s 5ms/step - loss: 1.1206 - recall: 0.7025 - val_loss: 0.6259 - val_recall: 0.5000 Epoch 7/25 99/99 [==============================] - 1s 8ms/step - loss: 1.0504 - recall: 0.7393 - val_loss: 0.6225 - val_recall: 0.5337 Epoch 8/25 99/99 [==============================] - 1s 7ms/step - loss: 0.9739 - recall: 0.7822 - val_loss: 0.6207 - val_recall: 0.5337 Epoch 9/25 99/99 [==============================] - 1s 8ms/step - loss: 0.9189 - recall: 0.7983 - val_loss: 0.6144 - val_recall: 0.5276 Epoch 10/25 99/99 [==============================] - 1s 8ms/step - loss: 0.8895 - recall: 0.7899 - val_loss: 0.6419 - val_recall: 0.5583 Epoch 11/25 99/99 [==============================] - 0s 5ms/step - loss: 0.8411 - recall: 0.8052 - val_loss: 0.6529 - val_recall: 0.5552 Epoch 12/25 99/99 [==============================] - 0s 5ms/step - loss: 0.8032 - recall: 0.8198 - val_loss: 0.6572 - val_recall: 0.5429 Epoch 13/25 99/99 [==============================] - 1s 5ms/step - loss: 0.7788 - recall: 0.8367 - val_loss: 0.6661 - val_recall: 0.5491 Epoch 14/25 99/99 [==============================] - 0s 5ms/step - loss: 0.7505 - recall: 0.8489 - val_loss: 0.6761 - val_recall: 0.5460 Epoch 15/25 99/99 [==============================] - 0s 5ms/step - loss: 0.7223 - recall: 0.8551 - val_loss: 0.6907 - val_recall: 0.5552 Epoch 16/25 99/99 [==============================] - 0s 5ms/step - loss: 0.7059 - recall: 0.8428 - val_loss: 0.7134 - val_recall: 0.5706 Epoch 17/25 99/99 [==============================] - 0s 5ms/step - loss: 0.6908 - recall: 0.8520 - val_loss: 0.7174 - val_recall: 0.5460 Epoch 18/25 99/99 [==============================] - 0s 5ms/step - loss: 0.6789 - recall: 0.8512 - val_loss: 0.7540 - val_recall: 0.5613 Epoch 19/25 99/99 [==============================] - 1s 6ms/step - loss: 0.6697 - recall: 0.8466 - val_loss: 0.7699 - val_recall: 0.5644 Epoch 20/25 99/99 [==============================] - 1s 6ms/step - loss: 0.6577 - recall: 0.8489 - val_loss: 0.7893 - val_recall: 0.5675 Epoch 21/25 99/99 [==============================] - 0s 5ms/step - loss: 0.6365 - recall: 0.8597 - val_loss: 0.8050 - val_recall: 0.5675 Epoch 22/25 99/99 [==============================] - 1s 5ms/step - loss: 0.6297 - recall: 0.8689 - val_loss: 0.8064 - val_recall: 0.5644 Epoch 23/25 99/99 [==============================] - 0s 5ms/step - loss: 0.6162 - recall: 0.8543 - val_loss: 0.8462 - val_recall: 0.5644 Epoch 24/25 99/99 [==============================] - 1s 5ms/step - loss: 0.6175 - recall: 0.8673 - val_loss: 0.8495 - val_recall: 0.5644 Epoch 25/25 99/99 [==============================] - 0s 5ms/step - loss: 0.6057 - recall: 0.8673 - val_loss: 0.8617 - val_recall: 0.5675
print("Time taken in seconds ",end-start)
Time taken in seconds 17.52609157562256
plot(history,'loss')
y_train_pred = model1.predict(X_train)
y_train_pred = (y_train_pred > 0.5)
y_train_pred
200/200 [==============================] - 0s 2ms/step
array([[ True],
[False],
[False],
...,
[False],
[ True],
[ True]])
y_valid_pred = model1.predict(X_valid)
y_valid_pred = (y_valid_pred > 0.5)
y_valid_pred
50/50 [==============================] - 0s 2ms/step
array([[False],
[False],
[False],
...,
[False],
[ True],
[ True]])
cl_rp = classification_report(y_train, y_train_pred)
print(cl_rp)
precision recall f1-score support
0 0.98 0.86 0.92 5096
1 0.63 0.92 0.75 1304
accuracy 0.87 6400
macro avg 0.80 0.89 0.83 6400
weighted avg 0.91 0.87 0.88 6400
cl_rp = classification_report(y_valid, y_valid_pred)
print(cl_rp)
precision recall f1-score support
0 0.87 0.71 0.78 1274
1 0.33 0.57 0.42 326
accuracy 0.68 1600
macro avg 0.60 0.64 0.60 1600
weighted avg 0.76 0.68 0.71 1600
modelName="NN with Adam and dropout"
train_data.loc[modelName] = recall_score(y_train, y_train_pred)
valid_data.loc[modelName] = recall_score(y_valid, y_valid_pred)
make_confusion_matrix(y_train, y_train_pred)
make_confusion_matrix(y_valid, y_valid_pred)
NN with Adam and dropout seemed to decrease the performance for the training set but increased the performance for the validation set. This is a better model compared to the others so far.
smote = SMOTE(random_state=42) # Create the SMOTE object
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train) # Resample the data
print('After UpSampling, the shape of train_X: {}'.format(X_train_smote.shape))
print('After UpSampling, the shape of train_y: {} \n'.format(y_train_smote.shape))
After UpSampling, the shape of train_X: (10192, 2943) After UpSampling, the shape of train_y: (10192,)
backend.clear_session()
np.random.seed(2)
random.seed(2)
tf.random.set_seed(2)
#Initializing the neural network
model2 = Sequential()
model2.add(Dense(70,activation="relu",input_dim=X_train_smote.shape[1]))
model2.add(Dense(17,activation="relu"))
model2.add(Dense(17,activation="relu"))
model2.add(Dense(1,activation="sigmoid"))
model2.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 70) 206080
dense_1 (Dense) (None, 17) 1207
dense_2 (Dense) (None, 17) 306
dense_3 (Dense) (None, 1) 18
=================================================================
Total params: 207611 (810.98 KB)
Trainable params: 207611 (810.98 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.SGD(0.001) # defining SGD as the optimizer to be used
metric = keras.metrics.Recall()
model2.compile(loss='binary_crossentropy', optimizer=optimizer,metrics = ['Recall'])
start = time.time()
history = model2.fit(X_train_smote, y_train_smote, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/25 157/157 [==============================] - 2s 7ms/step - loss: 1.8193 - recall: 0.9541 - val_loss: 0.9011 - val_recall: 0.9847 Epoch 2/25 157/157 [==============================] - 1s 5ms/step - loss: 1.5489 - recall: 0.9984 - val_loss: 0.9720 - val_recall: 0.9939 Epoch 3/25 157/157 [==============================] - 1s 5ms/step - loss: 1.4393 - recall: 0.9992 - val_loss: 0.9848 - val_recall: 0.9939 Epoch 4/25 157/157 [==============================] - 1s 5ms/step - loss: 1.3738 - recall: 0.9994 - val_loss: 0.9814 - val_recall: 0.9908 Epoch 5/25 157/157 [==============================] - 1s 4ms/step - loss: 1.3222 - recall: 0.9992 - val_loss: 0.9716 - val_recall: 0.9908 Epoch 6/25 157/157 [==============================] - 1s 5ms/step - loss: 1.2779 - recall: 0.9984 - val_loss: 0.9586 - val_recall: 0.9847 Epoch 7/25 157/157 [==============================] - 1s 5ms/step - loss: 1.2378 - recall: 0.9978 - val_loss: 0.9481 - val_recall: 0.9816 Epoch 8/25 157/157 [==============================] - 1s 5ms/step - loss: 1.2005 - recall: 0.9971 - val_loss: 0.9340 - val_recall: 0.9724 Epoch 9/25 157/157 [==============================] - 1s 5ms/step - loss: 1.1650 - recall: 0.9961 - val_loss: 0.9282 - val_recall: 0.9663 Epoch 10/25 157/157 [==============================] - 1s 5ms/step - loss: 1.1307 - recall: 0.9955 - val_loss: 0.9187 - val_recall: 0.9571 Epoch 11/25 157/157 [==============================] - 1s 5ms/step - loss: 1.0974 - recall: 0.9941 - val_loss: 0.9095 - val_recall: 0.9479 Epoch 12/25 157/157 [==============================] - 1s 6ms/step - loss: 1.0647 - recall: 0.9922 - val_loss: 0.8959 - val_recall: 0.9202 Epoch 13/25 157/157 [==============================] - 1s 8ms/step - loss: 1.0328 - recall: 0.9896 - val_loss: 0.8892 - val_recall: 0.8834 Epoch 14/25 157/157 [==============================] - 1s 8ms/step - loss: 1.0014 - recall: 0.9869 - val_loss: 0.8786 - val_recall: 0.8374 Epoch 15/25 157/157 [==============================] - 1s 5ms/step - loss: 0.9705 - recall: 0.9843 - val_loss: 0.8706 - val_recall: 0.8067 Epoch 16/25 157/157 [==============================] - 1s 5ms/step - loss: 0.9403 - recall: 0.9843 - val_loss: 0.8574 - val_recall: 0.7791 Epoch 17/25 157/157 [==============================] - 1s 5ms/step - loss: 0.9111 - recall: 0.9821 - val_loss: 0.8514 - val_recall: 0.7699 Epoch 18/25 157/157 [==============================] - 1s 5ms/step - loss: 0.8824 - recall: 0.9816 - val_loss: 0.8408 - val_recall: 0.7270 Epoch 19/25 157/157 [==============================] - 1s 5ms/step - loss: 0.8546 - recall: 0.9798 - val_loss: 0.8354 - val_recall: 0.7147 Epoch 20/25 157/157 [==============================] - 1s 5ms/step - loss: 0.8279 - recall: 0.9788 - val_loss: 0.8275 - val_recall: 0.6994 Epoch 21/25 157/157 [==============================] - 1s 5ms/step - loss: 0.8022 - recall: 0.9792 - val_loss: 0.8160 - val_recall: 0.6810 Epoch 22/25 157/157 [==============================] - 1s 7ms/step - loss: 0.7777 - recall: 0.9774 - val_loss: 0.8106 - val_recall: 0.6656 Epoch 23/25 157/157 [==============================] - 1s 9ms/step - loss: 0.7546 - recall: 0.9765 - val_loss: 0.8009 - val_recall: 0.6258 Epoch 24/25 157/157 [==============================] - 2s 10ms/step - loss: 0.7331 - recall: 0.9772 - val_loss: 0.7943 - val_recall: 0.6043 Epoch 25/25 157/157 [==============================] - 2s 11ms/step - loss: 0.7129 - recall: 0.9761 - val_loss: 0.7885 - val_recall: 0.5951
print("Time taken in seconds ",end-start)
Time taken in seconds 42.71089267730713
plot(history,'loss')
y_train_pred = model2.predict(X_train_smote)
y_train_pred = (y_train_pred > 0.5)
y_train_pred
319/319 [==============================] - 1s 2ms/step
array([[ True],
[False],
[False],
...,
[ True],
[ True],
[ True]])
y_valid_pred = model2.predict(X_valid)
y_valid_pred = (y_valid_pred > 0.5)
y_valid_pred
50/50 [==============================] - 0s 2ms/step
array([[False],
[ True],
[False],
...,
[ True],
[ True],
[ True]])
cl_rp = classification_report(y_train_smote, y_train_pred)
print(cl_rp)
precision recall f1-score support
0 0.97 0.67 0.79 5096
1 0.75 0.98 0.85 5096
accuracy 0.82 10192
macro avg 0.86 0.82 0.82 10192
weighted avg 0.86 0.82 0.82 10192
cl_rp = classification_report(y_valid, y_valid_pred)
print(cl_rp)
precision recall f1-score support
0 0.84 0.54 0.66 1274
1 0.25 0.60 0.35 326
accuracy 0.56 1600
macro avg 0.55 0.57 0.51 1600
weighted avg 0.72 0.56 0.60 1600
modelName="NN with SMOTE and SGD"
train_data.loc[modelName] = recall_score(y_train_smote, y_train_pred)
valid_data.loc[modelName] = recall_score(y_valid, y_valid_pred)
make_confusion_matrix(y_train_smote, y_train_pred)
make_confusion_matrix(y_valid, y_valid_pred)
NN with SMOTE and SGD showed a decrease in recall values and would not be an improvement upon our previous model.
backend.clear_session()
np.random.seed(2)
random.seed(2)
tf.random.set_seed(2)
#Initializing the neural network
model3 = Sequential()
model3.add(Dense(70,activation="relu",input_dim=X_train_smote.shape[1]))
model3.add(Dense(17,activation="relu"))
model3.add(Dense(17,activation="relu"))
model3.add(Dense(1,activation="sigmoid"))
model3.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 70) 206080
dense_1 (Dense) (None, 17) 1207
dense_2 (Dense) (None, 17) 306
dense_3 (Dense) (None, 1) 18
=================================================================
Total params: 207611 (810.98 KB)
Trainable params: 207611 (810.98 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining SGD as the optimizer to be used
metric = keras.metrics.Recall()
model3.compile(loss='binary_crossentropy', optimizer=optimizer,metrics = ['Recall'])
start = time.time()
history = model3.fit(X_train_smote, y_train_smote, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/25 157/157 [==============================] - 2s 7ms/step - loss: 1.1999 - recall: 0.9700 - val_loss: 0.8121 - val_recall: 0.6043 Epoch 2/25 157/157 [==============================] - 1s 5ms/step - loss: 0.7605 - recall: 0.9557 - val_loss: 0.7599 - val_recall: 0.5644 Epoch 3/25 157/157 [==============================] - 1s 7ms/step - loss: 0.6164 - recall: 0.9696 - val_loss: 0.7111 - val_recall: 0.5123 Epoch 4/25 157/157 [==============================] - 1s 9ms/step - loss: 0.5577 - recall: 0.9717 - val_loss: 0.7791 - val_recall: 0.5491 Epoch 5/25 157/157 [==============================] - 1s 9ms/step - loss: 0.5108 - recall: 0.9788 - val_loss: 0.8028 - val_recall: 0.5890 Epoch 6/25 157/157 [==============================] - 1s 9ms/step - loss: 0.4616 - recall: 0.9825 - val_loss: 0.7680 - val_recall: 0.5184 Epoch 7/25 157/157 [==============================] - 1s 6ms/step - loss: 0.4203 - recall: 0.9851 - val_loss: 0.8012 - val_recall: 0.4908 Epoch 8/25 157/157 [==============================] - 1s 5ms/step - loss: 0.3778 - recall: 0.9857 - val_loss: 0.8295 - val_recall: 0.4939 Epoch 9/25 157/157 [==============================] - 1s 5ms/step - loss: 0.3336 - recall: 0.9884 - val_loss: 0.8698 - val_recall: 0.4724 Epoch 10/25 157/157 [==============================] - 1s 5ms/step - loss: 0.2921 - recall: 0.9908 - val_loss: 0.8712 - val_recall: 0.4110 Epoch 11/25 157/157 [==============================] - 1s 5ms/step - loss: 0.2457 - recall: 0.9925 - val_loss: 0.9552 - val_recall: 0.4233 Epoch 12/25 157/157 [==============================] - 1s 5ms/step - loss: 0.2168 - recall: 0.9929 - val_loss: 0.9588 - val_recall: 0.3650 Epoch 13/25 157/157 [==============================] - 1s 5ms/step - loss: 0.1742 - recall: 0.9965 - val_loss: 1.0398 - val_recall: 0.3620 Epoch 14/25 157/157 [==============================] - 1s 5ms/step - loss: 0.1450 - recall: 0.9973 - val_loss: 1.0610 - val_recall: 0.3190 Epoch 15/25 157/157 [==============================] - 1s 5ms/step - loss: 0.1255 - recall: 0.9967 - val_loss: 1.1189 - val_recall: 0.3190 Epoch 16/25 157/157 [==============================] - 1s 5ms/step - loss: 0.1071 - recall: 0.9969 - val_loss: 1.1895 - val_recall: 0.2883 Epoch 17/25 157/157 [==============================] - 1s 5ms/step - loss: 0.0910 - recall: 0.9978 - val_loss: 1.2393 - val_recall: 0.2914 Epoch 18/25 157/157 [==============================] - 1s 5ms/step - loss: 0.0801 - recall: 0.9974 - val_loss: 1.2890 - val_recall: 0.2914 Epoch 19/25 157/157 [==============================] - 1s 5ms/step - loss: 0.0643 - recall: 0.9978 - val_loss: 1.3613 - val_recall: 0.3129 Epoch 20/25 157/157 [==============================] - 2s 10ms/step - loss: 0.0588 - recall: 0.9992 - val_loss: 1.3809 - val_recall: 0.2791 Epoch 21/25 157/157 [==============================] - 2s 12ms/step - loss: 0.0487 - recall: 0.9994 - val_loss: 1.4757 - val_recall: 0.2883 Epoch 22/25 157/157 [==============================] - 2s 10ms/step - loss: 0.0439 - recall: 0.9988 - val_loss: 1.5160 - val_recall: 0.2822 Epoch 23/25 157/157 [==============================] - 1s 8ms/step - loss: 0.0482 - recall: 0.9988 - val_loss: 1.5498 - val_recall: 0.3067 Epoch 24/25 157/157 [==============================] - 1s 5ms/step - loss: 0.0415 - recall: 0.9994 - val_loss: 1.6068 - val_recall: 0.3067 Epoch 25/25 157/157 [==============================] - 1s 5ms/step - loss: 0.0323 - recall: 0.9996 - val_loss: 1.6510 - val_recall: 0.2669
print("Time taken in seconds ",end-start)
Time taken in seconds 27.022613763809204
plot(history,'loss')
y_train_pred = model3.predict(X_train_smote)
y_train_pred = (y_train_pred > 0.5)
y_train_pred
319/319 [==============================] - 1s 2ms/step
array([[ True],
[False],
[False],
...,
[ True],
[ True],
[ True]])
y_valid_pred = model3.predict(X_valid)
y_valid_pred = (y_valid_pred > 0.5)
y_valid_pred
50/50 [==============================] - 0s 2ms/step
array([[False],
[False],
[False],
...,
[False],
[ True],
[False]])
cl_rp = classification_report(y_train_smote, y_train_pred)
print(cl_rp)
precision recall f1-score support
0 1.00 0.99 1.00 5096
1 0.99 1.00 1.00 5096
accuracy 1.00 10192
macro avg 1.00 1.00 1.00 10192
weighted avg 1.00 1.00 1.00 10192
cl_rp = classification_report(y_valid, y_valid_pred)
print(cl_rp)
precision recall f1-score support
0 0.82 0.88 0.85 1274
1 0.36 0.27 0.31 326
accuracy 0.76 1600
macro avg 0.59 0.57 0.58 1600
weighted avg 0.73 0.76 0.74 1600
modelName="NN with SMOTE and Adam"
train_data.loc[modelName] = recall_score(y_train_smote, y_train_pred)
valid_data.loc[modelName] = recall_score(y_valid, y_valid_pred)
make_confusion_matrix(y_train_smote, y_train_pred)
make_confusion_matrix(y_valid, y_valid_pred)
NN with SMOTE and Adam gave a recall of 1 for exited customers in the training set but gave only a value of 0.27 for the validation set. This is not an ideal model as it is an overfitted model with a lot of noise.
backend.clear_session()
np.random.seed(2)
random.seed(2)
tf.random.set_seed(2)
#Initializing the neural network
model4 = Sequential()
model4.add(Dense(35,activation="relu",input_dim=X_train_smote.shape[1]))
model4.add(Dropout(0.4))
model4.add(Dense(7,activation="relu"))
model4.add(Dropout(0.2))
model4.add(Dense(1,activation="sigmoid"))
model4.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 35) 103040
dropout (Dropout) (None, 35) 0
dense_1 (Dense) (None, 7) 252
dropout_1 (Dropout) (None, 7) 0
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 103300 (403.52 KB)
Trainable params: 103300 (403.52 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining SGD as the optimizer to be used
metric = keras.metrics.Recall()
model4.compile(loss='binary_crossentropy', optimizer=optimizer,metrics = ['Recall'])
start = time.time()
history = model4.fit(X_train_smote, y_train_smote, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/25 157/157 [==============================] - 2s 7ms/step - loss: 1.4658 - recall: 0.9700 - val_loss: 0.8654 - val_recall: 0.8344 Epoch 2/25 157/157 [==============================] - 1s 5ms/step - loss: 1.0600 - recall: 0.9700 - val_loss: 0.7939 - val_recall: 0.6656 Epoch 3/25 157/157 [==============================] - 1s 7ms/step - loss: 0.8777 - recall: 0.9625 - val_loss: 0.7825 - val_recall: 0.6043 Epoch 4/25 157/157 [==============================] - 1s 8ms/step - loss: 0.7886 - recall: 0.9609 - val_loss: 0.7979 - val_recall: 0.5828 Epoch 5/25 157/157 [==============================] - 1s 6ms/step - loss: 0.7466 - recall: 0.9584 - val_loss: 0.8239 - val_recall: 0.5798 Epoch 6/25 157/157 [==============================] - 1s 5ms/step - loss: 0.6920 - recall: 0.9619 - val_loss: 0.7807 - val_recall: 0.5583 Epoch 7/25 157/157 [==============================] - 1s 5ms/step - loss: 0.6774 - recall: 0.9657 - val_loss: 0.7815 - val_recall: 0.5521 Epoch 8/25 157/157 [==============================] - 1s 5ms/step - loss: 0.6476 - recall: 0.9641 - val_loss: 0.7802 - val_recall: 0.5460 Epoch 9/25 157/157 [==============================] - 1s 5ms/step - loss: 0.6245 - recall: 0.9659 - val_loss: 0.8246 - val_recall: 0.5736 Epoch 10/25 157/157 [==============================] - 1s 5ms/step - loss: 0.6299 - recall: 0.9674 - val_loss: 0.7966 - val_recall: 0.5767 Epoch 11/25 157/157 [==============================] - 1s 5ms/step - loss: 0.6060 - recall: 0.9721 - val_loss: 0.8023 - val_recall: 0.5675 Epoch 12/25 157/157 [==============================] - 1s 5ms/step - loss: 0.5739 - recall: 0.9729 - val_loss: 0.8054 - val_recall: 0.5736 Epoch 13/25 157/157 [==============================] - 1s 5ms/step - loss: 0.5694 - recall: 0.9719 - val_loss: 0.8285 - val_recall: 0.5828 Epoch 14/25 157/157 [==============================] - 1s 5ms/step - loss: 0.5585 - recall: 0.9725 - val_loss: 0.8253 - val_recall: 0.5859 Epoch 15/25 157/157 [==============================] - 1s 5ms/step - loss: 0.5630 - recall: 0.9712 - val_loss: 0.8167 - val_recall: 0.5951 Epoch 16/25 157/157 [==============================] - 1s 5ms/step - loss: 0.5447 - recall: 0.9753 - val_loss: 0.8226 - val_recall: 0.5951 Epoch 17/25 157/157 [==============================] - 1s 5ms/step - loss: 0.5330 - recall: 0.9765 - val_loss: 0.8409 - val_recall: 0.6012 Epoch 18/25 157/157 [==============================] - 1s 6ms/step - loss: 0.5209 - recall: 0.9770 - val_loss: 0.8380 - val_recall: 0.5798 Epoch 19/25 157/157 [==============================] - 1s 7ms/step - loss: 0.5237 - recall: 0.9759 - val_loss: 0.8490 - val_recall: 0.5828 Epoch 20/25 157/157 [==============================] - 1s 8ms/step - loss: 0.5082 - recall: 0.9763 - val_loss: 0.8733 - val_recall: 0.5982 Epoch 21/25 157/157 [==============================] - 1s 5ms/step - loss: 0.4849 - recall: 0.9810 - val_loss: 0.8725 - val_recall: 0.5798 Epoch 22/25 157/157 [==============================] - 1s 5ms/step - loss: 0.4727 - recall: 0.9784 - val_loss: 0.8953 - val_recall: 0.5675 Epoch 23/25 157/157 [==============================] - 1s 5ms/step - loss: 0.4539 - recall: 0.9806 - val_loss: 0.8978 - val_recall: 0.5767 Epoch 24/25 157/157 [==============================] - 1s 5ms/step - loss: 0.4542 - recall: 0.9802 - val_loss: 0.8677 - val_recall: 0.5521 Epoch 25/25 157/157 [==============================] - 1s 5ms/step - loss: 0.4385 - recall: 0.9796 - val_loss: 0.8726 - val_recall: 0.5337
print("Time taken in seconds ",end-start)
Time taken in seconds 22.912023544311523
plot(history,'loss')
y_train_pred = model4.predict(X_train_smote)
y_train_pred = (y_train_pred > 0.5)
y_train_pred
319/319 [==============================] - 1s 2ms/step
array([[ True],
[False],
[False],
...,
[ True],
[ True],
[ True]])
y_valid_pred = model4.predict(X_valid)
y_valid_pred = (y_valid_pred > 0.5)
y_valid_pred
50/50 [==============================] - 0s 2ms/step
array([[False],
[False],
[False],
...,
[False],
[ True],
[ True]])
cl_rp = classification_report(y_train_smote, y_train_pred)
print(cl_rp)
precision recall f1-score support
0 0.98 0.89 0.93 5096
1 0.90 0.99 0.94 5096
accuracy 0.94 10192
macro avg 0.94 0.94 0.94 10192
weighted avg 0.94 0.94 0.94 10192
cl_rp = classification_report(y_valid, y_valid_pred)
print(cl_rp)
precision recall f1-score support
0 0.86 0.75 0.80 1274
1 0.35 0.53 0.43 326
accuracy 0.71 1600
macro avg 0.61 0.64 0.61 1600
weighted avg 0.76 0.71 0.73 1600
modelName="NN with SMOTE, Adam and dropout"
train_data.loc[modelName] = recall_score(y_train_smote, y_train_pred)
valid_data.loc[modelName] = recall_score(y_valid, y_valid_pred)
make_confusion_matrix(y_train_smote, y_train_pred)
make_confusion_matrix(y_valid, y_valid_pred)
NN with SMOTE, Adam and dropout has a low precision but a reasonably high recall for training. However there is a significant ddifference in performance of the validation set.
print("Training data")
train_data
Training data
| recall | |
|---|---|
| NN SGD | 0.815951 |
| NN Adam | 0.999233 |
| NN with Adam and dropout | 0.915644 |
| NN with SMOTE and SGD | 0.978807 |
| NN with SMOTE and Adam | 0.999608 |
| NN with SMOTE, Adam and dropout | 0.985871 |
print("validating data")
valid_data
validating data
| recall | |
|---|---|
| NN SGD | 0.386503 |
| NN Adam | 0.533742 |
| NN with Adam and dropout | 0.567485 |
| NN with SMOTE and SGD | 0.595092 |
| NN with SMOTE and Adam | 0.266871 |
| NN with SMOTE, Adam and dropout | 0.533742 |
diff = train_data-valid_data
diff
| recall | |
|---|---|
| NN SGD | 0.429448 |
| NN Adam | 0.465491 |
| NN with Adam and dropout | 0.348160 |
| NN with SMOTE and SGD | 0.383715 |
| NN with SMOTE and Adam | 0.732736 |
| NN with SMOTE, Adam and dropout | 0.452129 |
NN with SMOTE and Adam performs the best on training data but very poorly on the validation set. NN SGD has less difference but performance wasn't the best. NN Adam had a 0.99 on training set but the validation set wasn't great and had a big drop from training set to validation set.
NN with Adam and dropout has a great value on training set and a reasonably good value on the validation set too. It also has the least difference between the performances for the training set and validation set. Therefore, I chose NN with Adam and dropout as my final model.
y_test_pred = model4.predict(X_test)
y_test_pred = (y_test_pred > 0.5)
print(y_test_pred)
63/63 [==============================] - 0s 4ms/step [[False] [ True] [False] ... [ True] [False] [False]]
make_confusion_matrix(y_test,y_test_pred)
cl_rp = classification_report(y_test, y_test_pred)
print(cl_rp)
precision recall f1-score support
0 0.86 0.75 0.80 1593
1 0.35 0.53 0.42 407
accuracy 0.70 2000
macro avg 0.60 0.64 0.61 2000
weighted avg 0.76 0.70 0.72 2000
Observations
Business Recommendations
Target younger customers to join as they are more likely to stay.
Encourage customers to stay active with the help of monthly deals and coupons.
Provide offers like discounts to luxury brands and free subscriptions to movie apps, music apps, etc., through brand partnerships to incentivize customers to stay.
Target customers located in France.
Power Ahead